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Abstract 

We introduce the idea of distortion side information, which does not directly depend on the source 
but instead affects the distortion measure. We show that such distortion side information is not only 
useful at the encoder, but that under certain conditions, knowing it at only the encoder is as good as 
knowing it at both encoder and decoder, and knowing it at only the decoder is useless. Thus distortion 
side information is a natural complement to the signal side information studied by Wyner and Ziv, which 
depends on the source but does not involve the distortion measure. Furthermore, when both types of side 
information are present, we characterize the penalty for deviating from the configuration of encoder-only 
distortion side information and decoder-only signal side information, which in many cases is as good as 
full side information knowledge. 

Keywords- distortion side information, separation principle, distributed source coding, rate loss, sensor 
networks, phase quantization 



I. Introduction 

In many large systems such as sensor networks, communication networks, distributed control, and 
biological systems different parts of the system may each have limited, noisy, or incomplete information 
but must somehow cooperate. Key issues in such scenarios include the penalty incurred due to the lack 
of shared information, possible approaches for combining information from different sources, and the 
more general question of how different kinds of information can be partitioned based on the role of each 
system component. 



One example of this scenario illustrated in Fig. 1(a) is when an observer records a signal x to be 



conveyed to a receiver who also has some additional signal side information w, which is correlated with 
X. As first introduced by Wyner and Ziv [4], [5] and extended by other researchers [6], [7], [8], in many 

This work has been presented in part at the Data Compression Conference in March 2004 [1], at the International Symposium 
on Information Theory in June 2004 [2], and in Emin Martinian's Ph.D. thesis [3]. 
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cases the observer and receiver can obtain the full benefit of the signal side information even if it is 
known only by the receiver. 



Encoder 



^^^ ^ Decoder 



Encoder 



^^^ ^ Decoder 
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(a) Signal side information w at tiie decoder. (b) Distortion side information q at the encoder. 

Fig. 1. Compressing a source x with signal side information or distortion side information. 

In this paper, we introduce a different scenario illustrated in Fig. |l(b)| where instead the observer has 
some distortion side information q describing which components of the data are more important than 
others, but the receiver may not have access to q. Specifically, let us model the differing importance 
of different signal components by measuring the distortion between the ith source sample x [i] and its 
quantized value x [i] by a distortion function that depends on the side information q [i] : 

d{x[i],x[i];q[i]). (1) 

In principle, one could treat the source-side information pair (q, x) as an "effective composite source", 
and apply conventional techniques to quantize it. Such an approach, however, ignores the different roles 
q and x play in the distortion. And as often happens in lossy compression, a good understanding of the 
distortion measure may lead to a more efficient system. Moreover, an interesting observation made in 
this paper is that in some important cases the encoder can obtain the full benefit of the distortion side 
information even if it is not known at the receiver. Hence, distortion side information at the encoder is 
a natural complement of the Wyner-Ziv setting. 

Sensor observations are one class of signals where the idea of distortion side information may be useful. 
For example, a sensor may have side information corresponding to reliability estimates for measured data 
(which may or may not be available at the receiver). This may occur if the sensor can calibrate its accuracy 
to changing conditions {e.g., the amount of light, background noise, or other interference present), if the 
sensor averages data for a variety of measurements {e.g., combining results from a number of sub-sensors) 
or if some external signal indicates important events {e.g., an accelerometer indicating movement). 

Alternatively, certain components of the signal may be more or less sensitive to distortion due to 
masking effects or context [9]. For example errors in audio samples following a loud sound, or errors 
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in pixels spatially or temporally near bright spots are perceptually less relevant. Similarly, accurately 
preserving certain edges or textures in an image or human voices in audio may be more important than 
preserving background patterns/sounds. Masking, sensitivity to context, etc., is usually a complicated 
function of the entire signal. Yet often there is no need to explicitly convey information about this 
function to the encoder. Hence, from the point of view of quantizing a given sample, it is reasonable to 
model such effects as side information which is roughly statistically independent of that sample. 

Clearly in performing data compression with distortion side information, the encoder should more 
heavily weight matching the more important data. The importance of exploiting the different sensitivities 
of the human perceptual system is widely recognized by engineers involved in the construction and 
evaluation of practical compression algorithms when distortion side information is available at both 
observer and receiver. In contrast, the value and use of distortion side information known only at either 
the encoder or decoder but not both has received relatively little attention in the information theory and 
quantizer design community. The rate-distortion function with decoder-only side information, relative to 
side information dependent distortion measures (as an extension of the Wyner-Ziv setting [4]), is given 
in [7]. A high resolution approximation for this rate-distortion function for locally quadratic weighted 
distortion measures is given in [10]. 

We are not aware of an information-theoretic treatment of encoder-only side information with such 
distortion measures. In fact, the mistaken notion that encoder-only side information is never useful is 
common folklore. This may be due to a misunderstanding of Berger's result that side information that 
does not affect the distortion measure is never useful when known only at the encoder [11], [6]. 

In this paper, we begin by studying the rate-distortion trade-off when side information about the 
distortion sensitivity is available. We show that such distortion side information can provide an arbitrarily 
large advantage (relative to no side information) even when the distortion side information is known only 
at the encoder. Furthermore, we show that just as knowledge of signal side information is often only 
required at the decoder, knowledge of distortion side information is often only required at the encoder. 
Finally, we show that these results continue to hold even when both distortion side information q and 
signal side information w are considered. Specifically, we demonstrate that a system where only the 
encoder knows q and only the decoder knows w is asymptotically as good as a system with all side 
information known everywhere. We also derive the penalty for deviating from this side information 
configuration {e.g., providing q to the decoder instead of the encoder). 

We first illustrate how distortion side information can be used even when known only by the observer 
with some motivating examples in Section |n] Next, in Section we precisely define a problem model 
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and state the relevant rate-distortion trade-offs. In Section |W] we consider scenarios where encoder-only 
knowledge of the distortion side information is optimal. Specifically, we show that sources that are 
uniformly distributed over a group with a difference distortion measure as well as arbitrary sources with 
"erasure distortion" have the property that encoder-only distortion side information is just as good as full 
distortion side information. In Section |Vl we study more general source and distortion models in the limit 
of high-resolution. Specifically, we show that in high-resolution, knowing distortion side information at 
the encoder and signal side information at the decoder is both necessary and sufficient to achieve the 
performance of a fully informed system. In Section EI we consider scaled quadratic distortions in the 
non-asymptotic (general resolution) regime and derive bounds on the loss with encoder-only distortion 
side information and decoder-only signal side information versus full side information. These bounds 
also show how quickly the high resolution regime is approached. Finally, we close with a discussion in 
Section IVIII followed by some concluding remarks in Section IVnil Throughout the paper, most proofs 
and lengthy derivations are deferred to the appendix. 

II. Motivating Examples 

A. Discrete Uniform Source 

Consider the case where the source x [i] corresponds to n samples each uniformly and independently 
drawn from the finite alphabet X with cardinality \X\ > n. Let q [i] correspond to n binary variables 
indicating which source samples are relevant. Specifically, let the distortion measure be of the form 
d{x, x;q) = if and only if either q = or x = x. Finally, let the sequence q [i] be statistically 
independent of the source with q [i] drawn uniformly from the n choose k subsets with exactly k ones.' 

If the side information were unavailable or ignored, then losslessly communicating the source would 
require exactly n-log \X\ bits. A better (though still sub-optimal) approach when encoder side information 
is available would be for the encoder to first tell the decoder which samples are relevant and then send 
only those samples. Using Stirling's approximation, this would require about n • H},{k/n) bits (where 
denotes the binary entropy function) to describe which samples are relevant plus k ■ log l^:"! bits to 
describe the relevant source samples. Note that if the side information were also known at the decoder, 
then the overhead required in telling the decoder which samples are relevant could be avoided and the 
total rate required would only be k ■ log This overhead can in fact be avoided even without decoder 
side information. 

'if the distortion side information is a Bemoulli(fc/n) sequence, then there will be about k ones with high probability. We 
focus on the case with exactly k ones for simplicity. 
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To see this, we view the source samples x [0], x [1], . . ., x [n — 1], as a codeword of an (n, k) Reed- 
Solomon (RS) code (or more generally any MDS^ code) with q[i] = indicating an erasure at sample i. 
We use the RS decoding algorithm to "correct" the erasures and determine the k corresponding information 
symbols, which are sent to the receiver. To reconstruct the signal, the receiver encodes the k information 
symbols using the encoder for the {n, k) RS code to produce the reconstruction x [0], x [1], . . ., x [n — 1]. 
Only symbols with q [i] = could have changed, hence x[i] = x [i] whenever q [i] = 1 and the relevant 
samples are losslessly communicated using only k ■ log \ X\ bits. 

As illustrated in Fig. |2j RS decoding can be viewed as curve-fitting and RS encoding can be viewed 
as interpolation. Hence this source coding approach can be viewed as fitting a curve of degree k to the 
points of X [i] where q [i] = 1. The resulting curve can be specified using just k elements. It perfectly 
reproduces x [i] where q[i\ = 1 and interpolates the remaining points. 




Fig. 2. Losslessly encoding a source with n = 7 points where only fc = 5 points are relevant (i.e., the unshaded ones), can be 
done by fitting a fourth degree curve to the relevant points. The resulting curve will require k elements (yielding a compression 
ratio of k/n) and will exactly reproduce the desired points. 



B. Gaussian Source 

A similar approach can be used to quantize a zero mean, unit variance, complex Gaussian source 
relative to quadratic distortion using the Discrete Fourier Transform (DFT). Specifically, to encode the 
source samples x[0], x[l], . . ., x[n — 1], we view the k relevant samples as elements of a complex, 
periodic, Gaussian, sequence with period n, which is band-limited in the sense that only its first k DFT 
coefficients are non-zero. Using periodic, band-limited, interpolation we can use only the k samples where 
<7 [i] = 1 to find the corresponding k DFT coefficients X [0], X [1], . . ., X [fc — 1]. 

^The desired MDS code always exists since we assumed \X\ > n. For \X\ < n, near MDS codes exist, which give 
asymptotically similar performance with an overhead that goes to zero as n — + cxj. 
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The relationship between the k relevant source samples and the k interpolated DFT coefficients has a 
number of special properties. In particular this kxk transformation is unitary and furthermore each DFT 
basis vector has an equal amount of energy in each component of the original basis.^ Hence, the DFT 
coefficients are Gaussian with unit variance and zero mean. Thus, the k DFT coefficients can be quantized 
with average distortion D per coefficient and k ■ R{D) bits where R{D) represents the rate-distortion 
trade-off for the quantizer. To reconstruct the signal, the decoder simply transforms the quantized DFT 
coefficients back to the time domain. Since the DFT coefficients and the relevant source samples are 
related by a unitary transformation, the average error per coefficient for these source samples remains 
unchanged, i.e., the error is exactly D. 

Note if the side information were unavailable or ignored, then at least n-R{D) bits would be required. 
If the side information were losslessly sent to the decoder, then n • Hi,{k/n) + k ■ R{D) would be 
required. Finally, even if the decoder had knowledge of the side information, at least k ■ R{D) bits would 
be needed. Hence, the DFT scheme achieves the same performance as when the side information is 
available at both the encoder and decoder, and is strictly better than ignoring the side information or 
losslessly communicating it. 

III. Notation, Problem Model, and Rate-Distortion Functions 

Vectors and sequences are denoted in bold (e.g., x) with the ith element denoted as x[i]. Random 
variables are denoted using the sans serif font (e.g., x) while random vectors and sequences are denoted 
with bold sans serif (e.g., x). 

We are primarily interested in two kinds of side information, which we call "signal side information" 
and "distortion side information". The former (denoted w) corresponds to information that is statistically 
related to the source but does not directly affect the distortion measure and the latter (denoted q) 
corresponds to information that is not directly related to the source but does directly affect the distortion 
measure. Formally, we capture this decomposition with the following definition: 

Definition 1 We define a source x, distortion side information q, signal side information w, and distortion 

^The Hadamard transform as well as any permuted version of the DFT basis will also have these properties so the choice of 
transform is not unique. Simply choosing any unitary transform is not sufficient, however. 
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measure'^ (i(x, x;q) as an admissible side information decomposition if the following Markov chains are 
satisfied: 

q ^ w X (2a) 
(i(x, x; q) X, X, q w. (2b) 

The simplest case where these Markov chains are satisfied is when q and w are independent and 
therefore q and x are statistically independent (without conditioning on w). While many of our results 
apply to the general setting, we often specialize our results to this case. 

Decomposing side information into distortion related and signal related components proves useful since 
it allows us to isolate two important insights in source coding with side information. First, as Wyner 
and Ziv discovered [4], knowing w only at the decoder is often sufficient. Second, as our examples 
in Section |n] illustrate, knowing q only at the encoder is often sufficient. Furthermore, the relationship 
between the side information and the distortion measure and the relationship between the side information 
and the source often arise from physically different effects and so such a decomposition is warranted 
from a practical standpoint. Of course, such a decomposition is not always possible and we explore 
some issues for general side information z, which affects both the source and distortion measure, in 
Sections |^ and IVII-CI 

In any case, we define the source coding with side information problem as the tuple 

{X, X, Q, W,px{x),p^\^{w\x),pq\^{q\w), d{x, x; q)). (3) 

Specifically, a source sequence x consists of the n samples x [1], x [2], . . ., x [n] drawn from the alphabet 
X. The signal side information w and the distortion side information q likewise consist of n samples 
drawn from the alphabets W and Q respectively. These random variables are generated according to the 
distribution 

n 

Px,q,w(x,q,w) = Y[px{x[i\) ■ Pw\x{w[i] \x[i]) ■ pq\^{q[i] \ w[i\). (4) 
1=1 

Note that the distortion measure and joint distribution in Q satisfy the admissibility condition of 
Definition [2 by construction. 

we find it notationally convenient to consider tlie distortion measure as a random variable. Tliis allows us to state the 
desired conditional independence relationship as a Markov chain, but in the paper we only consider standard distortion measures 
that are deterministic functions. This incurs no loss of generality since randomized distortion measures can always be replaced 
with expected values without changing our performance measures. 
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Fig. 3. Possible scenarios for source coding with distortion dependent and signal dependent side information q and w. The 
labels a, b, c, and d are 0/1 if the corresponding switch is open/closed and the side information is unavailable/available to the 
encoder /(•) or decoder g{-). 



Fig. |3l illustrates the sixteen possible scenarios where q and w may each be available at the encoder, 
decoder, both, or neither depending on whether the four switches are open or closed. A rate R encoder /(•) 
maps a source as well as possible side information to an index i G {1,2, . . . ,2"^}. The corresponding 
decoder g{-) maps the resulting index as well as possible decoder side information to a reconstruction 
of the source. Distortion for a source x, which is quantized and reconstructed to the sequence x taking 
values in the alphabet X, is measured via 

1 " 

d(x, x; q) = - d{x [i\ ,x[i];q [i]). (5) 
1=1 

As usual, the rate-distortion function is the minimum rate such that there exists a system where the 
distortion is less than D with probability approaching 1 as n oo. We denote the sixteen possible rate- 
distortion functions by describing where the side-information is available. For example, /^[q-none-w-none] {D) 
denotes the rate-distortion function without side information and i?[Q-NONE-w-DEc](D) denotes the Wyner- 
Ziv rate-distortion function where w is available at the decoder [4]. Similarly, when all information is 
available at both encoder and decoder, i?[Q-BOTH-w-BOTH](£)) describes Csiszar and Korner's [7] gener- 
alization of Gray's conditional rate-distortion function i?[Q-NONE-w-BOTH](L)) [12] to the case where the 
side information can affect the distortion measure. 

As pointed out by Berger [13], all the rate-distortion functions may be derived by considering q as part 
of X or w {i.e., by considering the "super-source" x' = (x, q) or the "super-side-information" w' = (w, q)) 
and applying well-known results for source coding, source coding with side information, the conditional 
rate-distortion theorem, etc. For example, if we set the signal side information w to null to simplify 
notation, the relevant rate-distortion functions are easy to obtain. 
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Proposition 1 The rate-distortion functions when w is null ( and hence Q implies x and q are indepen- 
dent) are 

R\q-none\(D) = inf /(x;x) (6a) 

p,^,{x\x)■.E[d{x,x■,q)]<D 

R[q-dec]{D)= inf I{x;u) - I{u;q) (6b) 

p^l^{u\x),v{-,-):E[d(x,v(u,q);q)]<D 

R[q-enc]{D) = inf I{x,q;x)= inf /(x; xjq) + /(x; q) (6c) 

Px|x.i,(a;|a;,g):-E[d(x,x;c/)]<_D p<i\x,q{x\x,q):E[d(x,x;q)]<D 

i?[Q-BOTH]p) = inf /(x;x|c/). (6d) 

The rate-distortion functions in ( Fat . dSR . and ( I6dt follow from standard results (e.g., [11], [6], [7], [12], 
[4]). To obtain d6cl) we can apply the classical rate-distortion theorem to the "super source" x' = (x, q). 

IV. When Encoder-only Knowledge is Optimal 

In this section, we consider the simplified case when the signal side information w is null (and hence 
Definition ^ implies x and q are independent) and derive conditions for when encoder-only knowledge 
of the distortion side information q is optimal. Comparing the rate-distortion functions in Proposition ^ 
immediately yields the following rate-loss result. 

Proposition 2 Knowing q only at the encoder is as good as knowing it at both encoder and decoder if 
and only if there exists an x that optimizes (I6dl) with the property that /(x; q) = 0. In such a case, the 
resulting x is also optimal for ibci and therefore ibci and ibdti are equal. 

The intuition for the "only if" part of Proposition |2 is illustrated in Fig. 0] Specifically, Px\q{x\q) 
represents the distribution of the codebook. Thus if a different codebook is required for different values 
of q, then the penalty for knowing q only at the encoder is exactly the information that the encoder 
must send to the decoder to allow the decoder to determine the proper codebook, i.e., /(x; q). The only 
way that knowing q at the encoder can be just as good as knowing it at both is if there exists a fixed 
codebook that is universally optimal regardless of q. 

One of the main insights of this paper is the intuition for the "if" part of Proposition |2 if a variable 
partition is allowed, universally good fixed codebooks exist as illustrated in Fig.|5l Specifically Px\q{x\q) 
represents the distribution of the codebook while p^i\x.q{x\x, q) represents the quantizer partition mapping 
source and side information to a codeword. Thus even if the side information affects the distortion and 
Px\x,q{x\x, q) depends on q, it may be that Px\q{x\q) is independent of q. In such cases (characterized by 
the condition /(x; q) = 0), there exists n fixed codebook with a variable partition, which is simultaneously 
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Fig. 4. An example of two different quantizer codebooks for different values of the distortion side information q. If q indicates 
the horizontal error (respectively, vertical error) is more important, then the encoder can use the codebook on the left (resp., 
right) to increase horizontal accuracy (resp., vertical accuracy). The penalty for knowing q only at the encoder is the amount 
of bits required to communicate which codebook was used. 



optimal for all values of the distortion side information q. Specifically, in such a system the reconstruction 
x{i) corresponding to a particular index i is fixed regardless of q, but the partition mapping x to the 
codebook index i depends on q. 

In various scenarios, this type of fixed codebook variable partition approach can be implemented via a 
lattice [14] as illustrated in Fig. |6l As discussed in Section |nl fixed codebook variable partition systems 
can also be constructed from transforms. Specifically, in Section ITl-B I the fixed codebook is generated by 
quantizing bandlimited signals. The resulting variable partition has cells similar to Fig. |5] but with the 
width of the cells in the unimportant coordinates being infinite. 

The quantization error will depend on the source distribution and size of the quantization cells. Thus 
if the quantization cells of a fixed codebook variable partition system like Fig. 15] are the same as the 
corresponding variable codebook system in Fig. |4j the performance will be the same. Intuitively, the main 
difference between these two figures (as well as general fixed codebook/variable partition schemes versus 
variable codebook schemes) is in the nature of the tiling of Fig. |5] versus Fig. |3] In Sections IIV-AI and 
Section IIV-CI we consider various scenarios where the uniformity in the source or distortion measure 
makes these two tilings equivalent and thus the condition /(x; q) = is satisfied for all distortions. 
Similarly, in Section|V]we show that if we ignore "edge effects" and focus on the high-resolution regime, 
then the difference in these tilings becomes negligible. Thus in the high-resolution regime we show that 
a wide range of source and distortion models admit variable partition, fixed codebook quantizers as in 
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Fig. 5. An example of a quantizer witli a variable partition and fixed codebook. If the encoder knows the horizontal error 
(respectively, vertical error) is more important, it can use the partition on the left (resp., right) to increase horizontal accuracy 
(resp., vertical accuracy). The decoder only needs to know the quantization point not the partition. 




Fig. 6. A fixed, hexagonal, lattice codebook with four different partitions. 



Fig. |5l and achieve /(x; q) = 0. 

Finally, we note that Proposition |2 suggests that distortion side information q complements signal side 
information w in the sense that q is useful at the encoder while w is useful at the decoder. In fact in 
Section IV-BI we strengthen this complementary relationship by showing that w is often useless at the 
encoder while q is often useless at the decoder. 

A. Uniform Sources with Group Difference Distortions 

Let the source x be uniformly distributed over a group X with the binary relation ©. For convenience, 
we use the symbol a 6 to denote a © (where b^^ denotes the additive inverse of b in the group). 
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We define a group difference distortion measure as any distortion measure where 



d{x, x; q) = p{x Q x; q) 



(7) 



for some function /)(•; •)• As we will show, the symmetry in this scenario insures that the optimal codebook 
distribution is uniform. This allows an encoder to design a fixed codebook and vary the quantization 
partition based on q to achieve the same performance as a system where both encoder and decoder 
know q. This uniformity of the codebook, made precise in the following theorem, provides a general 
information theory explanation for the Reed-Solomon example in III-AI 

Theorem 1 Let the source x be uniformly distributed over a group with a difference distortion measure 
and let the distortion side information q be independent of the source. Then, to optimize (|63, it suffices 
to choose the test-channel distribution PS(\x,q{x\x., q) such that P)(\q{x\q) is uniform for each q (and hence 
independent of q) implying that there is no rate-loss as stated in Proposition |2 i.e. 



For either finite or continuous groups this theorem can be proved by deriving the conditional Shannon 
Lower Bound (which holds for any source) and showing that this bound is tight for uniform sources. 
We use this approach below to give some intuition. For more general "mixed groups" with both discrete 
and continuous components, entropy is not well defined and a more complicated argument based on 
symmetry and convexity is provided in Appendix |X] 

Lemma 1 (Conditional Shannon Lower Bound) Let the source x be distributed over a discrete group, 
X, with a difference distortion measure, p{x Q x; q). Then if we define z* as the random variable that 
maximizes H{z\q) subject to the constraint E[p{z; q)] < D, then 



For continuous groups, \X\ and H{z*\q) can be replaced by the Lebesgue measure of the group and 
h{z*\q) in ^ (as well as the following proof). 



R[Q-ENc]{D) = i?[Q-BOTH]. 



(8) 



R[q-enc\{D) > log \X\ - H{z*\q). 



(9) 
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Proof: 



/(x;x, q) > log \X\ + H{q) - H{x, q\x) 



(10) 



log \X\ + H{q) - H{q\x) - H{x Q x\x, q) 



(11) 



> loglA"! -H{xQx\q) 



(12) 



> \og\X\ -H{z*\q) 



(13) 



where dlOt follows since a uniform source independent of q maximizes entropy, (fT2t follows since 
conditioning reduces entropy, and il3l follows from the definition of z* since E[p{x Q x; q)] < D. ■ 
Proof of Theorem \l] Choosing the test-channel distribution x = z* + x with the pair (z*,q) 
independent of x achieves the bound in ^ with equality and must therefore be optimal. Furthermore, 
since x is uniform, so is x and therefore x and q are statistically independent. Therefore /(x; cjf) = and 
thus comparing (I5cl to shows i?[Q-ENc](D) = R[Q-B(yTH]{D) for finite groups. The same argument 
holds for continuous groups with entropy replaced by differential entropy and \X\ replaced by Lebesgue 
measure. ■ 



Uniform source and group difference distortion measures arise naturally in a variety of applications. 
One example is phase quantization where applications such as Magnetic Resonance Imaging, Synthetic 
Aperture Radar, and Ultrasonic Microscopy infer physical phenomena from the phase shifts induced 
in a probe signal [15], [16], [17]. Alternatively, when magnitude and phase information must both be 
recorded, there are sometimes advantages to treating these separately, [18], [19], [20], [21]. The key 
special case when only two phases are recorded corresponds to Hamming distortion and we use this 
scenario to illustrate how distortion side information affects quantization. 

For a symmetric binary source, we first derive the various rate-distortion trade-offs for a general 
Hamming distortion measure depending on q. Next we adapt this general result to the special cases of 
quantizing noisy observations and quantizing with a weighted distortion measure. The former demonstrates 
that the naive encoding method where the encoder losslessly communicates the side information to the 
decoder and then uses optimal encoding, can require arbitrarily higher rate than the optimal rate-distortion 
trade-off. The second example demonstrates that ignoring the side information can result in arbitrarily 
higher distortion than the minimum required by optimal schemes. 



B. Examples 
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1 ) General Formula For Hamming Distortion Depending on q; Consider a symmetric binary source x 
with side information q taking values in {1,2,..., N} with distribution Pq{q). Let distortion be measured 
according to 

d{x,x;q) = aq + Pq ■ dH{x,x) (14) 

where {ai,a2, ■ ■ ■ ,0^} and {j3i, (32, ■ ■ ■ , Pn} are sets of non-negative weights. In Appendix IbI we derive 
the various rate-distortion functions for D > E[aq] to obtain 

i?[Q-NONE] (D) = i?[Q-DEc] (D) = 1 - Hf, ^"^H (15a) 



R[q-enc]{D) = i?[Q-BOTH](D) = i-Y^p^{i).Hj j (15b) 

i=l ^ 



where A is chosen to satisfy the distortion constraint 



TV 



OLi + /3i 



2-Aft 



D. (16) 



1 + 

2) Noisy Observations: To provide a more concrete illustration of the effect of side information at 
the encoder, consider the special case where x is a noisy observation of an underlying source received 
through a binary symmetric channel (BSC) with cross over probability specified by the side information. 
Specifically, let the cross over probabihty of the BSC be 

2{N-l) ' 

which is always at most 1/2. 

A distortion of 1 is incurred if a cross over occurs due to either the noise in the observation or the 
noise in the quantization (but not both): 

d{x,x;q) = • [1 - x)] + (1 - e^) • x) 

= Cq + {1 - 2eq) ■ dHix,x) 



This corresponds to a distortion measure in the form of (fT4l l with Oq = {q — 1)/[2(A^ — 1)] and (3q = 
1 — {q — l)/(iV — 1). Hence, the rate-distortion formulas from (ITsl apply. The optimal encoding strategy 
is to encode the noisy observation directly as discussed in [22] although with different amounts of 
quantization depending on the side information. 

In the left plot of Fig. we illustrate the rate-distortion function for this problem with and without 
side information at the encoder in the case where q G {1, 2} {i.e., N = 2) and the observation is either 
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noise-less or completely noisy. In the right plot of Fig. Q we illustrate the rate-distortion functions in 
the limit where ^ oo and the noise is uniformly distributed between and 1/2. Note that in the 
latter case, if the side information were encoded losslessly then log extra bits of overhead would be 
required beyond the amount when optimal encoding is used. Hence, communicating the side information 
losslessly can require an arbitrarily large rate even though optimal use of the encoder side information 
reduces the quantization rate. 




0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 

Rate Rate 



Fig. 7. Rate-Distortion curves for noisy observations of a binary source. The solid curve represents the minimum possible 
Hamming distortion when side information specifying the cross-over probability of the observation noise is available at the 
encoder (or both at the encoder and decoder). The dashed curve represents the minimum distortion when side information is 
not available (or ignored) at the encoder. For the plot on the left the cross over probability for the observation noise is equally 
likely to be or 1/2, while for the plot on the right it is uniformly distributed over the interval [0, 1/2]. 

3 ) Weighted Distortion: In the previous example, certain source samples were more important because 
they were observed with less noise. In this section we consider a model where certain samples of a source 
are inherently more important than others {e.g., edges in a binary image, other perceptually important 
features, or sensor readings in high activity areas). Specifically, we consider a distortion measure of 
the form il4l where Pk = exp^'yk/N), = 0, and the side information is uniformly distributed over 
{0,l,...,iV-l}. 

The left plot in Fig. [8] illustrates the rate-distortion curves for the case when N = 2, while the right 
plot corresponds to the case where N —>■ oo. The former model corresponds to two types of samples: 
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important samples where a distortion of exp(7/2) is incurred if quantized incorrectly and normal samples 
where a distortion of 1 is incurred if quantized incorrectly. If no side information is available (or if side 
information is ignored), the encoder must treat these samples equally. If side information at the encoder 
is used optimally, then more bits are spent quantizing the important samples. The latter model illustrates 
the case when there is a continuum of importance for the samples. 




0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 

Rate Rote 



Fig. 8. Rate-Distortion curves for a binary source witii weigiited Hamming distortion. Tiie distortion for quantizing each 
source sample is measured via Hamming distortion times the weight exp(5 • q). The solid curve represents the minimum 
possible Hamming distortion when side information specifying the weight is available at the encoder (or both at the encoder 
and decoder). The dashed curve represents the minimum distortion when no side information is available at the encoder. In the 
left plot, q is uniformly distributed over the pair {0, 1} while in the right plot q is uniformly distributed over the interval [0, 1]. 

In the limit as 7 ^ oo and N = 2, the system not using side information, suffers increasingly 
more distortion. This is most evident for rates greater than 1/2. In this rate region, the system with side 
information losslessly encodes the important samples and the distortion is bounded by 0.5 while the 
system without side information has a distortion that scales with exp(7/2). Thus the extra distortion 
incurred when q is not available to the encoder can be arbitrarily large. 
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C. Erasure Distortions 

Another scenario where the condition /(x; q) = is satisfied is for "erasure distortions" where q is 
binary and the distortion measure is of the form 

d{x,x;q)=p{x,x)-q (18) 

for some function •). 

Theorem 2 For any source distribution, if the distortion measure is of the form in (1181) with binary q, 
then the rate-distortion function when q is known at only the encoder is the same as when q is known 
at both encoder and decoder, i.e., 

R[q-enc]{D) = R[q-both]{D). (19) 

Proof: Let x* be a distribution that optimizes i6dl . Choose the new random variable x** to be the 
same as x* when q = and when q = 1, let x** be independent of x with the same marginal distribution 
as when q = 0: 

Px'\x,q{i\x,q), q = 



Px"\x,q{x\x,q) = < 



(20) 

Px'\q{x\q = 0), q = l- 



Both X* and x** have the same expected distortion since they only differ when q = 0. Furthermore, by 
the data processing inequality 

/(x";x|q) </(x*;x|c7) (21) 

so X** also optimizes ( I6dt . Finally, since I{x**;q) = 0, Proposition |2l is satisfied and we obtain the 
desired result. ■ 
As shown in the examples of Section mi erasure distortions may admit particularly simple quantizers 
to optimally use encoder side information. 

V. Asymptotically Zero Loss for High-Resolution Quantization 

In Proposition 121 of Section]^ we saw that knowing q only at the encoder is as good as full knowledge 
of q when the codebook is independent of the distortion side information (i.e., when /(x; q) = 0). In the 
high-resolution regime,^ x begins to closely approximate x. Thus when x and q are independent we may 

^Usually, the high-resolution limit is defined as when D ^ 0, but it is sometimes useful to consider distortion measures with 
a constant penalty. Hence we assume it is possible to define a minimum distortion Dmin which can be approached arbitrarily 
closely as the rate increases and we define the high-resolution limit as D ^ -Dmin- 



February 1, 2008 



DRAFT 



18 



intuitively expect that as x ^ x we have /(x; q) —>■ /(x; q) = 0. In this section, we rigorously justify 
this intuition and also consider some variations. 

For example, we consider the more general case where the signal side information w, which is statisti- 
cally dependent on the source, is present. Our problem model (and specifically the Markov conditions in 
^) require q to be conditionally independent of x given w and require the distortion to be conditionally 
independent of w given q, x, and x. But since our model allows for q and w to be statistically dependent, 
q may now be indirectly correlated with x (through w) and w may indirectly affect the distortion (through 
q). Even in this more general scenario, the basic intuition that distortion side information is necessary 
only at the encoder and signal side information is necessary only at the decoder continues to hold in 
many cases of interest. 

A. Technical Conditions 

In addition to the side information decomposition implied by Definition [0 our results require a 
"continuity of entropy" property that essentially states 

z ^ h{x + z\q, w) —>■ h{x\q, w). (22) 

The desired continuity follows from [23] provided the source, distortion measure, and side information 
satisfy some technical conditions related to smoothness. These conditions are not particularly hard to 
satisfy; for example, any vector source, side information, and distortion measure where 

35 > 0,-00 < S[||x||V = w] < oo Vw (23a) 

— oo < /i(x|w = w) < cxD, Vw (23b) 

d(x, x; q) = a(q)+/?(q) • | |x - x| p('i) (23c) 

will satisfy the desired technical conditions in [23] provided q(-), /?(•), and 7(-) are non-negative 
functions. For more general scenarios we introduce the following definition to summarize the requirements 
from [23]. 

Definition 2 We define a source x, side information pair (q,w), and difference distortion measure 
d{x,x;q) = p{x — x;q) as admissible if the following conditions are satisfied: 



1) the equations 

a{D,q) J exp[-s{D,q)p{x;q)]dx = I (24a) 

aiD,q) j p{x;q)exp[-siD,q)p{x;q)]dx = D (24b) 
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have a unique pair of solutions {a{D, q), s{D, q)) for all D > -Dmin that are continuous functions 
of their arguments 

2) —oo < h(x\w = w) < oo, for all w 

3) For each value of q, there exists an auxiliary distortion measure S{-;q) where the equations 

as{D,q) j eyiY>[-S5{D,q)5{x]q)]dx = 1 (25a) 
as{D,q) j 6{x;q)exp[-ssiD,q)6{x;q)]dx = D (25b) 

have a unique pair of solutions {as{D, q), ss{D, q)) for all D > Dmin that are continuous functions 
of their arguments 

4) The distribution Zq that maximizes h{zq\q) subject to the constraint E[p[zq;qy\ < D has the 
property that 

lim Za ^0 in distribution Vg (26a) 

^ Hm E[5{x + Zq,q)\q = q]=E[5{x,q)\q = q] Vr?. (26b) 

B. Equivalence Theorems 

Our main results for continuous sources consist of various theorems describing when different types of 
side information knowledge are equivalent. All our results require the basic side information decomposi- 
tion and Markov chains in Definition ^ Some results require further conditions such as high-resolution, 
scaled difference distortion measures, or statistical independence between q and w {e.g., when w is 
known only at the decoder, knowing q only at the encoder is asymptotically as good as knowing it at 
both when q and w are independent). The equivalence theorems stated below are proved in Appendix ICl 
and summarized in Fig. 15] Essentially, we show that the sixteen possible information configurations can be 
reduced to the four shown in Fig.|^ Specifically, to determine the performance of a given side information 
configuration one can essentially ask two questions: "Does the encoder have q?" and "Does the decoder 
have wT . A negative answer to either question incurs some penalty relative to the case with all side 
information known everywhere. In contrast, a positive answer to both questions incurs asymptotically no 
rate-loss relative to complete side information. 

We begin by generalizing previous results about no rate-loss for the Wyner-Ziv problem in the high- 
resolution limit [8] [10] to show there is no rate-loss when q is known at the encoder and w is known 
at the decoder. This results suggests that there is a natural division of side information between the 
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Decoder missing w 


Decoder iias w 


Encoder 
Missing 

q 


_R[Q-DEC-W-ENq <^ ii[Q-DEC-W-NONE] 
$ Th.|5](S+I) $ Th.|5l(S+I) 

i?,[Q-NONE-W-ENC] _R[Q-N0NE-W-N0NE] 


nr 1 Th.|5l(S) -^r , 

J?,[q-dec-w-both] <^ _R[q-none-w-both] 

]]; Th. 1^ (H+S+I) ]]; Th. |6| (H+S+I) 

Ji;[q-dec-w-dec] <^ _R[q-none-w-dec] 


Encoder 
lias q 


_R[Q-ENC-W-ENC] <^ i?[Q-ENC-W-NONE] 

$ Th.|6l(H+I) ]); Th.|6](H+I) 
i?[Q-BOTH-W-ENC] [Q-BOTH-W-NONE] 


ii![Q-ENC-W-DEC] 4^ _R [Q-ENC-W-BOTH] 

]jTh.|6l(H) ttTh.|6](H) 
J?[Q-B0TH-W-DEC] ^^Sf*' i?[Q-BOTH-W-BOTH] 



Fig. 9. Summary of main results for continuous sources. Arrows indicate wliich theorems demonstrate equality between various 
rate-distortion functions and list the assumptions required (H = high-resolution, I = q and w independent, S = scaled difference 
distortion). 



encoder and decoder (at least asymptotically). Specifically, distortion side information should be sent to 
the encoder and signal side information should be sent to the decoder. 

Theorem 3 For any admissible source, side information, and difference distortion measure satisfying 
Definitions and |2 q and w can be divided between the encoder and decoder with no asymptotic 
penalty, i.e., 

^lim i?[Q-ENC-W-DEc](L') — ii[Q-BOTH-W-BOTH] (D) = 0. (27) 

The next two theorems generalize Berger's result that signal side information is useless when known 
only at the encoder [11] to show when w and q are useless at the encoder and decoder respectively. These 
results suggest that deviating from the natural division of Theorem |5] and providing side information in 
the wrong place makes that side information useless (at least in terms of the rate-distortion function). 

Theorem 4 Let q and w be independent^ and consider a difference distortion measure of the form 
d{x — x; q). Then w provides no benefit when known only at the encoder, i.e., 

i?[Q-*-W-ENC](i:') = i?[Q-*-W-NONE](L») (28) 

where the wildcard "*" may be replaced with an element from {enc, dec, both, none} (both *'s must be 
replaced with the same element). 

^Independence is only required when * G {DEC, NONE}; if * G {ENC, BOTH}, the theorem holds without this condition 
provided the side information decomposition is admissible according to Definition Q 
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Theorem 5 Let q and w be independenf and consider a scaled distortion measure of the form d(x, x; q) = 
(io(g)di(x, x). Then q provides no benefit when known only at the decoder, i.e., 

i?[Q-DEC-W-*](D) = i?[Q-NONE-W-*](L>) (29) 

where the wildcard "*" may be replaced with an element from {enc, dec, both, none} (both *'s must be 
replaced with the same element). 

Using the previous results, we can generalize Theorem |3] to show that regardless of where w (respec- 
tively q) is known, knowing q (w) only at the encoder (decoder) is sufficient in the high-resolution limit. 
This result essentially suggests that even if the ideal of providing q to the encoder and w to the decoder 
suggested by Theorem |3] is impossible, it is still useful to follow this ideal as much as possible. 

Theorem 6 Let q and w be independent} For any source and scaled^ difference distortion measure 
d{x,x;q) = do{q) ■ di{x — x) satisfying the conditions in Definitions ^\ and ^ q (respectively, wj is 
asymptotically only required at the encoder (respectively, at the decoder), i.e., 

lim i?[Q-ENC-W-*l(Z)) - i?[Q-BOTH-W-*l(L)) = (30a) 

lim R\q-*-w-dec\(D) - ii[Q-*-w-BOTHl(i:>) = (30b) 

D->-Dmi„ 

where the wildcard may be replaced with an element from {enc, dec, both, none} (both *'s must be 
replaced with the same element). 

C. Penalty Theorems 

We can compute the asymptotic penalty for not knowing the signal side information w at the decoder. 



^Independence is only required when * G {ENC, NONE}; if * G {DEC, BOTH}, the theorem holds without this condition 
provided the side information decomposition is admissible according to Definition Q 

^Independence is only required when * G {ENC, NONE} in <30at or when * G {DEC, NONE} in <30b> . For * G {DEC, BOTH} 
in <30at or * G {ENC, BOTH} in <30bt the theorem holds without this condition provided the side information decomposition is 
admissible according to Definition Q 

'The scaled form of the distortion measure is only required when * G {DEC, NONE} in <30b> . When * G {ENC, BOTH}, the 
theorem only requires a difference distortion measure of the form d(x, x; q) — d'{x — x; q). 
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Theorem 7 Let q and w be independent. Then for any source and scaled difference distortion measure 
d{x, x; q) = dQ{q) ■ di{x — x) satisfying the conditions in Definition |2] the penalty for not knowing w at 
the decoder is 

^ lim i?[Q-*-W-{ENC-OR-NONE}](Z)) — i?[Q-*-W-{DEC-OR-BOTH}] (D) = /(x; w) (31) 

where the wildcard "*" may be replaced with an element from {enc, dec, both, none} (all *'s must be 
replaced with the same element). 

Theorem also gives us insight about distortion side information that is not independent of the source. 
Specifically, imagine that the side information z affects the distortion via d{x,x; z) = do{z) ■ di{x — x) 
and furthermore, z is correlated with the source. What is the penalty for knowing z only at the encoder 
versus at both encoder and decoder? To answer this question, we can decompose z into our framework 
by setting q = z and w = z and computing 

lim i?[Q-ENC-W-ENcl(D) — i?[Q-BOTH-W-BOTHl(Z)). (32) 

Since this pair of q and w are statistically dependent, we take into account the footnote in Theorem 
Therefore we see that the asymptotic penalty for knowing general side information only at the encoder is 
exactly the degree to which the source and distortion side information are related as measured by mutual 
information: 

Corollary 1 For any source and scaled difference distortion measure d{x,x;z) = dQ{z) ■ di{x — x) 
satisfying the conditions in Definition |3 the penalty for knowing general side information z only at the 
encoder is 

^ lim R[z-ENc]{D) - R[z-both]{D) = I{x; z). (33) 

Finally, we can compute the asymptotic penalty for not knowing the distortion side information q at 
the encoder. 

Theorem 8 Let q and w be independent.^^ For any source taking values in the k-dimensional real vector 
space and a scaled, norm-based distortion measure d(x, x; q) = q • | |x — x| |^ satisfying the conditions 

'"independence is only required when * G {DEC, NONE}; if * G {ENC, BOTH}, the theorem holds without this condition 
provided the side information decomposition is admissible according to Definition Q 

"independence is only required when * £ {ENC, NONE}; if * G {DEC, BOTH}, the theorem holds without this condition 
provided the side information decomposition is admissible according to Definition Q 
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in Definition^ the penalty (in nats/sample) for not knowing q at the encoder is 

r Kfnll 

(34) 



k 

lim i?[Q-{DEC-OR-NONE}-W-*l(D) — i2[Q-{ENC-OR-BOTH}-W-*] (D) = -E 

D^D^i^ r 



q 

where the wildcard may be replaced with an element from {enc, dec, both, none} (both *'s must be 
replaced with the same element). 

A similar result, which essentially compares the asymptotic difference between 

i2[Q-DEC-W-DEc](L') and i?[Q-BOTH-W-BOTH](D) 

for non-independent q and w with squared norm distortion, is derived in [10]. Thus as with Corollary [fl 
[10] and Theorem [Sl can be interpreted as saying that the asymptotic penalty for knowing general side 
information z only at the decoder can be quantified by the degree to which the distortion and the side 
information are related as measured by (I34t . Specifically, setting q = z and w = z yields distortion side 
information and signal side information that are statistically dependent. Thus from [10] or by applying 
the footnote to Theorem [8l we obtain the following Corollary: 

Corollary 2 For any source taking values in the k-dimensional real vector space and a scaled, norm- 
based distortion measure d{x,x; z) = z • ||x — xH*^' satisfying the conditions in Definition^ the penalty 
( in nats/sample ) for not knowing z at the encoder is 

" ElzV 



k 

lim i2[z-DEd(L>) - R\z-boyb\{D) = -E 

D^D^in r 



In- 



(35) 



According to Jensen's inequality, this rate gap is always greater than or equal to zero with equality if 
and only if the distortion side information is a constant with probability 1. Furthermore, the rate gap is 
scale invariant in the sense that it does not change when the distortion side information is multiplied by 
any positive constant. 

In Table HI we evaluate the high-resolution rate penalty for a number of possible distributions for 
the side-information. Note that for all of these side information distributions (except the uniform and 
exponential distributions), the rate penalty can be made arbitrarily large by choosing the appropriate shape 
parameter to place more probability near q = or q = cxd. In the former case (LogNormal, Gamma, 
or Pathological q), the large rate-loss occurs because when q 0, the informed encoder can transmit 
almost zero rate while the uninformed encoder must transmit a large rate to achieve high resolution. In 
the latter case (Pareto or Cauchy q), the large rate-loss is caused by the heavy tails of the distribution 
for q. Specifically, even though q is big only very rarely, it is the rare samples of large q that dominate 
the moments. Thus an informed encoder can describe the source extremely accurately during the rare 
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occasions when q is large, while an uninformed encoder must always spend a large rate to obtain a low 
average distortion. 

TABLE I 

Asymptotic rate-penalty in nats. Euler's constant is denoted by 7. 

The rate penalties below are for not knowing distortion side information q at the encoder when distortion is measured via 
d{x, x; q) = q{x — x^ . (Multiply penalties in nats by 1/ In 2 w 1.44 to convert to bits). 



Distribution Name 


Density for q 


Rate Gap in nats 


Exponential 


Texp(— gr) 


-iln7W 0.2748 


Uniform 




i(l-ln2) w 0.1534 


Lognormal 




4 


Pareto 


■^,q> 6 > 0,a > 1 


ifln^-l/a] 


Gamma 


b(bq)^~^ exp( — 6g) 

r(a) 


Hina-£[inr(x)].=4«i 


Pathological 


{l-e)6{q-e) + e6{q-l/e) 


iln(l + e-e2)-i^lne«ilni 


Positive Cauchy 




00 



Finally, note that all but one of these distributions would require infinite rate to losslessly communicate 
the side information. Thus the gains in using distortion side information cannot be obtained by exactly 
describing the side information to the receiver. 

VI. NoN- Asymptotic Bounds for Quadratic Distortions 

So far we have shown that in the high-resolution Umit, knowing q at the encoder is sufficient. Our 
main analytical tool was the additive test-channel distribution x = x + Zq where the additive noise in the 

test channel depends on the distortion side information. Evidently, additive noise test-channels of this 
type are asymptotically optimal. To investigate the rate-loss at finite resolutions we develop two results 
for scaled quadratic distortion measures. We also briefly mention how these results can be generahzed 
to other distortion measures. 
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A. A Medium Resolution Bound Using Fisher Information 

Theorem 9 Consider a scaled quadratic distortion measure of the form d{x, x;q) = q • {x — x)"^ with 
Q ^ ^min > 0. Then the maximum rate-gap at distortion D is bounded by 

i?[Q-ENC-W-DEc](I?) — [Q-BOTH-W-BOTh] (D) < ^^^'^^ . (35) 

2 'Zmin 

where J(x| w) is the Fisher Information in estimating a non-random parameter t from r + x conditioned 
on knowing w. Specifically, 



J{x\w) = 







1 Pw{w) < 





^'^ogp^\^{x\w) 




(37) 



Similar bounds can be developed with other distortion measures provided that D / Qmin is replaced with a 
quantity proportional to the variance of the quantization error. See the remark after the proof of Theorem|9l 
in the appendix for details. Also, Zamir and Feder discuss related bounds in [24, Appendix D]. 

One may wonder why Fisher Information appears in the above rate-loss bound. After all. Fisher 
Information is most commonly used to lower bound the error in estimating a parameter via its use in the 
Cramer-Rao bound. What does Fisher Information have to do with source coding? 

To answer this question, recall that our bounds are all developed by using an additive test-channel 
distribution of the form x = x + z. Thus, a clever source decoder could treat each source sample x [i] 
as a parameter to be "estimated" from the quantized representation x If an efficient estimator exists, 
this procedure could potentially reduce the distortion by the reciprocal of the Fisher Information. But 
if the distortion can be reduced in this manner without affecting the rate, then the additive test-channel 
distribution must be sub-optimal and a rate gap must exist. 

So the bound in Theorem |9l essentially measures the rate gap by measuring how much our additive 
test-channel distribution could be improved if an efficient estimator existed for x given x. This bound 
will tend to be good when an efficient estimator does exist and poor otherwise. For example, if x is 
Gaussian with unit-variance conditioned on w, then the Fisher Information term in (I36t evaluates to one 
and the worst-case rate-loss is at most half a bit at maximum distortion. This corresponds to the half-bit 
bound on the rate-loss for the pure Wyner-Ziv problem derived in [8]. But if x is discontinuous {e.g., if 
X is uniform), then no efficient estimator exists and the bound in (I36t is poor. 

As an aside, we note that the proof of Theorem |^ does not require any extra regularity conditions. 
Hence, if the Fisher Information of the source is finite, it can be immediately applied without the need 
to check whether the source is admissible according to Definition |2l 
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B. A Low Resolution Bound 

While the Fisher Information bound from i36l can be used at low resolutions, it can be quite poor if 
the source is not smooth. Therefore, we derive the following bound on the rate-loss, which is independent 
of the distortion level and hence most useful at low resolution. 

Theorem 10 Consider a scaled quadratic distortion measure of the form d{x,x;q) = q ■ {x — x)^ and 
denote the minimum/maximum conditional variance of the source by 

crH,j„ = minVar [x|i/i/ = (38 a) 

w 

"^max = max Var [x| w = w]. (38b) 

w 

Then the gap in Theorem is at most the conditional relative entropy of the source from a Gaussian 
distribution plus a term depending on the range of the conditional variance: 

i?[Q-ENC-w-DEc](Z)) - i?[Q-BOTH-w-BOTH](i:)) < Z)(p;,|^| |AA(Var [x])) + ]- log (l + 1^^^ (39) 

^ V 0-min / 

where M{t) represents a Gaussian random variable with mean zero and variance t. 

Similar bounds can be developed for other distortion measures as discussed after the proof of Theorem fTOl 
in the appendix. 

Consider the familiar Wyner-Ziv scenario where the signal side information is a noisy observation of 
the source. Specifically, let w = x + v where v is independent of x. In this case, the conditional variance 
is constant and ( I39t becomes 

i^(Pxkl|AA(Var[x])) + ^log2 (40) 

and the rate-loss is at most half a bit plus the deviation from Gaussianity of the source. 

If X is Gaussian when conditioned on w = w, then the rate-loss is again seen to be at most half a bit 
as in [8]. In contrast to [8], which is independent of the source, however, both our bounds in d36b and 
depend on the source distribution. Hence, we conjecture that our bounds are loose. In particular, 
for a discrete source, the worst case rate loss is at most H{x\w), but this is not captured by our results 
since both bounds become infinity. Techniques from [25], [26], [8] may yield tighter bounds. 

C. A Finite Rate Gaussian Example 

To gain some idea of when the asymptotic results take effect, we consider a finite rate Gaussian 
example. Specifically, let the source consist of a sequence of Gaussian random variables with mean zero 
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and variance 1 and consider distortion side information with Pr[q = 1] = 0.6, Pr[cjf = 10] = 0.4, and 
distortion measure d(x, x; q) = q ■ {x — x)^. 

The case without side information is equivalent to quantizing a Gaussian random variable with distortion 
measure 4.6(x — x)^ and thus the rate-distortion function is 



i?[Q-NONE-W-NONE] (D) 



0, D> 4.6 

(41) 

iln^, D<4:.6. 



To determine i?[Q-BOTH-w-NONE](D) we must set up a constrained optimization as we did for the binary- 
Hamming scenario in Appendix |B] This optimization results in a "water-pouring" bit allocation, which 
uses more bits to quantize the source when q = W than when q = I. Specifically, the optimal test-channel 
is a Gaussian distribution where both the mean and the variance depend on q and thus x has a Gaussian 
mixture distribution. Going through the details of the constrained optimization yields 



i?[Q-BOTH-W-NONE] (D) 



0, 4.6 < D 

^In^^, D*<D<4.6 (42) 



^ln^ + (^ln-I, D<D* 
for some appropriate threshold D*. Evaluating d34b for this case indicates that the rate-gap between (|4T]) 
and ^ goes to 0.5 • (In 4.6 - 0.4 In 10) « 0.3 nats « 0.43 bits. 

Computing i?[Q-ENC-w-NONE](L') analytically seems difficult. Thus, when distortion side information is 
only available at the encoder we obtain a numerical upper bound on the rate by using the same codebook 
distribution as when q is known at both encoder and decoder. This yields a rate penalty of /(x; q).^^ 
We can obtain a simple analytic bound from Theorem |9l Specifically, evaluating (I36t yields that the rate 
penalty is at most (1/2) • min[l,D]. 

In Fig. ^1 we evaluate these rate-distortion trade-offs. We see that at zero rate, the rate-distortion 
functions for the case of no side information, encoder-only side information, and full side information 
have the same distortion since no bits are available for quantization. Furthermore, we see that the Fisher 
Information bound is loose at zero rate. As the rate increases, the system with full distortion side- 
information does best because it uses the few available bits to represent only the important source 
samples with q = 10. The decoder reconstructs these source samples from the compressed data and 
reconstructs the less important samples to zero (the mean of x). In this regime, the system with distortion 

'^Actually, since the rate distortion function is convex, we take the lower convex envelope of the curve resulting from the 
optimal test-channel distribution. 
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Fig. 10. Rate-distortion curves for quantizing a Gaussian source x with distortion q{x — x)^ where the side information q is 
1 with probabihty 0.6 or 10 with probability 0.4. From bottom to top on the right the curves correspond to the rate required 
when both encoder and decoder know q, a numerically computed upper bound to the rate when only the encoder knows q, the 
rate when neither encoder nor decoder know q, and the Fisher Information upper bound from Theorem |9| for when only the 
encoder knows q. 

side information at the encoder also more accurately quantizes the important source samples. But since 
the decoder does not know q, it does not know which samples of x to reconstruct to zero. Thus the 
system with q available at the encoder performs worse than the one with q at both encoder and decoder 
but better than the system without side information. As the rate increases further, both systems with 
distortion side information quantize source samples with both q = I and q = 10. Thus the codebook 
distribution for x goes from a Gaussian mixture to become more and more Gaussian and the rate-loss 
for the system with only encoder side information goes to zero. Finally, we note that even at the modest 
distortion of —5 dB, the asymptotic effects promised by our theorems have already taken effect. 
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VII. Discussion 

In this paper, we introduced the notion of distortion side information, which does not directly depend 
on the source but instead affects the distortion measure. Furthermore, we showed that if general side 
information can be decomposed into such a distortion dependent component and a signal dependent 
component then under certain conditions the former is required only at the encoder and the latter is 
required only at the decoder. In this section, we discuss some implications and applications of these 
ideas. 

A. Applications to Sensors and Sensor Networks 

There is a growing interest in sensor networks where multiple nodes with potentially correlated 
observations cooperate to sense the environment. A variety of researchers have already demonstrated 
the advantages of distributed source coding in efficiently using signal side information in such networks 
[27], [28], [29], [30], [31]. We believe, however, that the possible applications of distributed coding with 
distortion side information are equally compelUng since many sensors naturally receive distortion side 
information in addition to the observed signal. 

For example, sensors which perform simple filtering or averaging of the incoming signal can use the 
variance as an indicator of reliability. Similarly, sensors can observe the background level of light, sound, 
etc., to obtain an estimate of the noise-floor in the absence of a signal of interest. Also, many systems 
are designed to record data at a roughly constant average amplitude. This is often accomplished by using 
automatic gain control (AGC) to compensate for attenuation due to distance, weather, or obstacles. Since 
thermal noise in the sensor front-end is usually independent of such attenuations, the AGC level can be 
used as a simple indicator of the signal-to-noise ratio of an observed signal. 

B. Richer Distortion Models and Perceptual Coding 

The notion of distortion side information may be particularly useful in the design of perceptual coders. 
Specifically, it is well-known that mean-square error is at best a poor representative of how humans 
experience distortion. Hence perceptual coding systems use features of the human visual system (HVS) or 
human auditory system (HAS) to achieve low subjective distortion even when the mean-square distortion 
is quite large. Unfortunately, creating such a perceptual coder often requires the designer to be an expert 
both in human physiology, as well as quantizer design. This difficulty may be one of the reasons why 
information theory has sometimes had less impact on compression standards than on communication 
standards. 
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Using the abstraction of distortion side information to represent such perceptual effects, however, may 
help overcome this barrier. For example, physiological experts could focus on producing a distortion 
model that incorporates perceptual effects by determining a value qi that scales the quadratic distortion 
between the ith source sample and its reconstruction. Quantization experts could focus on designing 
compression systems operating with distortion side information. Then various perceptual models could 
be quickly and easily combined with quantizer designs to find the best combination. This modular design 
would allow a good quantization system to be used in a variety of application domains simply by changing 
the model for the distortion side information. Similarly, it would allow a finely tuned perceptual model 
to be used in many types of quantizer designs. 

The theoretical justification for using distortion side information to design modular quantizers is 
Theorem |3l This theorem can be interpreted as saying that even if the perceptual model that produces the 
distortion weighting is a complicated function of the source, the decoder does not need to know how the 
perceptual model works. Instead, it is sufficient for the decoder simply to know the statistics of q provided 
that q is available to the encoder. Corollary ^ strengthens this conclusion since even when the source and 
distortion side information are statistically dependent, the gap between the modular, encoder-only side 
information architecture and a system with full side information is the mutual information between the 
source and the side information. No system (including non-modular systems where the decoder attempts 
to account for dependence between the source and distortion measure) could expect to do better since 
this rate gap essentially corresponds to how many bits about the source the side information implicitly 
conveys. Since no explicit side information is available at the decoder, no scheme can recover this rate 

gap- 
Finally, it may be somewhat premature to advocate a particular design for perceptual coders based on 
primarily on rate-distortion results. At a minimum, however, we point out that if a perceptual distortion 
side information model can be constructed it can at least be used to find a bound on the minimum possible 
bit rate at a given distortion. Having such a performance benchmark to strive for can be a powerful force 
in inspiring system designer to search for new innovations. 

C. Decomposing Side Information Into q and w 

Our problem model of Section |ffl] requires that the side information can be decomposed into distortion 
side information and signal side information. In many scenarios, this may be a natural view of the 
compression task. In problems where this is not the case, however, we briefly discuss how general side 
information may be decomposed into the pair (q,w) required by our theorems. 
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First, notice that the formulation in ^ is completely general in the sense that any side information 
z taking values in the set Z can be trivially decomposed into (q, w) by setting (q, w) = (z, z) with 
(Q, VV) = {Z,Z). Of course, this makes most of our results uninteresting. One systematic procedure for 
potentially improving this decomposition is as follows. First, replace each pair of values {qo,qi) where 
d{-,-]Qo) = d{-,-;qi) with a new value q' and adjust Q accordingly. Next, replace each pair of values 
{wo, wi) where Px\w{-\wo) = Px\w{-\wi) and Pq|^(-|wo) = Pq\w{-\wi) with a new value w' and adjust W 
accordingly. If the resulting Q and W are smaller than Z, then our results become non-trivial. 

As with the problem of determining a minimal sufficient statistics, there are many potential decom- 
positions and different notions of good decompositions may be appropriate in different applications. For 
example, minimizing the cardinality of Q, W, or both, might be useful in simplifying quantizer design 
or related tasks. Alternatively, minimizing I{q; w), H{q), H{w), or similar information measures may 
be useful if q/w must be communicated to the encoder/decoder by a third party. 

VIII. Concluding Remarks 

Our analysis indicates that side information that affects the distortion measure can provide significant 
benefits in source coding. Perhaps, our most surprising result is that in a number of cases, {e.g., sources 
uniformly distributed over a group, or in the high-resolution limit) side information at the encoder is just 
as good as side information known at both encoder and decoder. Furthermore, this "separation theorem" 
can be composed with the previously known result that having signal side information at the decoder 
is often as good as having it both encoder and decoder {e.g., in the high-resolution limit). Our main 
results regarding when knowing a given type of side information at one place is as good as knowing 
it at another place are summarized in Fig. |^ Also, we computed the rate-loss for lacking a particular 
type of side information in a specific place. These penalty theorems show that lacking the proper side 
information can produce arbitrarily large degradations in performance. Taken together, we believe these 
results suggest that distortion side information is a useful source coding paradigm. 

Appendix 

A. Group Difference Distortion Measures Proof 

Proof of Theorem^] Assume that p*,^^ ^{x\x , q) is an optimal test-channel distribution with the 
conditional p*.^^{x\q). By symmetry, for any t ^ X, the shifted distribution 

P%x,qi^\^^ Q) = P*x\x,qi^ ® © t, q) (43) 
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must also be an optimal test-channel. Since mutual information is convex in the test-channel distribution, 
we obtain an optimal test-channel distribution p** by averaging t over X via the uniform measure dx{t): 



(44) 



X 



X JX 



To prove that the resulting distribution for x given q is uniform for all q (and hence independent of q), 
we will show that = © r\q) for any r ^ X: 

(45) 
(46) 
(47) 
(48) 
(49) 
(50) 
(51) 



X JX 



X JX 



X JX 



V\\x 




q)d 
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Vx\x 
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^i,q)dx(t)dx(x) 


* 

Px\x 




r © 


t|a; © r © t, g)(iA'(r © t)dx(x) 


* 

Vx\x 


g(£© 


r © 


t\x®r® t, q)dx{t)dx{x © r) 


* 

Vx\x 


q(£© 


r © 


t X © t, g)(iA:'(t)d;(:'(x) 



IX JX 

V7\q{x®r\q). 



Equation (l45l follows from Bayes' law and the fact that dx is the uniform measure on X. The next 
two lines follow from the definition of p** and respectively. To obtain (R^ . we make the change of 
variable t — > r © t, and then apply the fact that the uniform measure is shift invariant to obtain ( I49l l. 
Similarly, we make the change of variable x © r ^ x to obtain dSOt . The last line follows from the 
definition in (I44t . 

Note that this argument applies regardless of whether the side information is available at the encoder, 
decoder, both, or neither. ■ 



B. Binary -Hamming Rate-Distortion Derivations 

In this section we derive the rate-distortion functions for a binary source with Hamming distortion. 

1 ) With Encoder Side Information: The rate-distortion function for source independent side information 
available only at the encoder is the same as with the side information available at both encoder and 
decoder. Hence, we compute i?[Q-ENc](L)) and R[q-both]{D) by considering the latter case and noting 
that optimal encoding corresponds to simultaneous description of independent random variables [32, 
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Section 13.3.3]. Specifically, the source samples for each value of q can be quantized separately using 
the distribution 



Px\x,q{x\x,q) = < 



( 

1-Pq, X = X 



(52) 

Pq, X = 1- X. 



The cross-over probabilities pq correspond to the bit allocations for each value of the side information 
and are obtained by solving a constrained optimization problem: 

N 

i?[Q-BOTH](L>) = min (53) 

Eld{x,x;q)=D] ^ 

where Hi,{-) is the binary entropy function. 

Using Lagrange multipliers, we construct the functional 

N N 

J{D)=Y,Pq{i) ■ [l+Pil0gPi + {l-pi)\0g{l-pi)\ + \Y,Pq{i) ■ [C^i+PA]. 
i=l i=l 

Differentiating with respect to pi and setting equal to 0, yields 

O J T)' 

— = p^{i)\og-^ + \pq{i)(3i = Q (54) 

Opi J- Pi 

\og-^ = -Xsi (55) 
1 - Pi 

2-AA 

T 

Thus we obtain the rate-distortion functions 



i?[Q-ENc] {D) = R[Q-BOTu] = 1 - ^ Pq{i) ■ I ^ ^ j (57a) 

V — 1 ^ ' 



1=1 

where A is chosen to satisfy 



N 



i=l 



= D. (57b) 



1 + 2-AA 

2) Without Encoder Side Information: When no encoder side information is available, decoder side 
information is useless. Hence, the problem is equivalent to quantizing a symmetric binary source with 
distortion measure 

d{x, x) = E[aq + /3q ■ dH{x, x)] = E[aq] + E[Pq] ■ dH{x, x). (58) 

The rate-distortion function is obtained by scahng and translating the rate-distortion function for the 
classical binary-Hamming case: 

i?[Q-NONE] {D) = l- H, ( ^^^y ) (59) 
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C. High-Resolution Proofs 

Proof of Theorem^ To obtain i?[Q-ENC-w-DEc](D) we apply the Wyner-Ziv rate-distortion formula'-' 
to the "super-source" x' = (x, q) yielding 

i?[Q-ENC-w-DEc](D) = inf I{x,q;x\w) (60) 

Ps\^.^{x\'X,q) 

where the optimization is subject to the constraint that E[d{x,v{x, w); q)] < D for some reconstruction 
function v{-, •). To obtain ii[Q-BOTH-w-BOTH](L>) we specialize the well-known conditional rate-distortion 
function to our notation yielding 

i?[Q-BOTH-w-BOTH](Z?) = inf I{x;x\w,q) (61) 

Px|x,,,w(£|x,(J,t0) 

where the optimization is subject to the constraint that E[d{x,x; q)] < D. 

Let us define x* as the distribution that optimizes dSUb . Similarly, define x^ as the distribution that 

optimizes d6n) . Finally, define z given q = q to be a random variable with a conditional distribution that 
maximizes h{z\q = q) subject to the constraint that 

E[d[x,x + z-q)\q = q] < E[d{x,xl-q)\q = q]. (62) 

Then we have the following chain of inequalities: 

/^R{D) = i?[Q-ENC-W-DEc](D) - i?[Q-BOTH-W-BOTH](L>) (63) 

= I{>^*] <7,^k) - [h{x\q, w) - h{x\q, w,x* )] (64) 

= Hx*; q,x\w) - h{x\q, w) + h{x - x^\q, m/,x* ) (65) 

< /(x*; q,x\w) - h{x\q, w) + h{x - x^\q) (66) 

< I{x*; q,x\w) — h{x\q, w) + h{z\q) (67) 

< I{x + z; q,x\w) — h{x\q, w) + h{z\q) (68) 
= h{x + z\w) — h{x + z\w, q,x) — h{x\q, w) + h{z\q) (69) 
= h{x + z\w) — h{x\q,w) (70) 
= h{x + z\w) — h{x\w) (71) 

lim AR(D) = 0. (72) 



13c 



Some readers may be more familiar with the Wyner-Ziv formula as a difference of mutual informations (e.g., as in [6]), but 
the form in <60> is equally valid [4] and is sometimes more convenient. 
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Equation i67l follows from the definition of z to be entropy maximizing subject to a distortion constraint. 
Since z is independent of x and w, the choice x = x + z with v{x,w) = x is an upper bound to (l60b 
and yields ( I68L We obtain dTTl by recalling that according to our problem model in @, qf and x are 
independent given w. Finally, we obtain (I72t by using the "continuity of entropy" result from [23, 
Theorem 1]. 

Note that although the z in [23, Theorem 1] is an entropy maximizing distribution while our z is 
a mixture of entropy maximizing distributions, the special form of the density is not required for the 
continuity of entropy result in [23, Theorem 1]. To illustrate this, we show how to establish the continuity 
of entropy directly for any distortion measure where D —>■ -Dmin =^ Var[z] —^ 0. One example of such 
a distortion measure is obtained if we choose d{x,x;q) = q ■ \x — with r > and Pr[(7 = 0] = 0. 
Denoting Var [z| w] as a"^^^ and Var [x| w] as a"^^^ and letting M{a) represent a Gaussian random variable 
with variance a yields 



lim sup h{x + z\w) — h{x\ w) = lim sup h{x + z\w) — h{x\ w) 



(73) 



lim sup h{x + z\w)± h{M{al^ ^ + crlj\w) 
±h{M{al^J\w)-h{x\w) 

limsupL>(p,|^||AA(a2 )) _ Z?(p,+,|^||AA(c72 + alj) 



<\ 

><\w^-z\wjH-H^i^l\w)H 

2 \\ r>^„ \\\rf^2 



+ limsup[/i(AA(cr 



limsup / KJV{(7.+a.)\w = w)-h{M{(T.)\w = w) dp^{w) 



limsup /i(AA(ct2|^ +al^J\w = w)- h{J\f{al^J\w = w) 



dpw{w) 



0. 



(74) 

(75) 

(76) 
(77) 
(78) 
(79) 



We obtain i75l since for any random variable v, the relative entropy from v to a Gaussian takes the 
special form D{py\\M{Yar [v])) = h{M{Yar [v])) - h{v) [32, Theorem 9.6.5]. To get ^ we use the 
fact that relative-entropy (and also conditional relative-entropy) is lower semi-continuous [33]. This could 
also be shown by applying Fatou's Lemma [34, p. 78] to get that if the sequences pi{x),p2{x), . . . and 
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Qi{x),q2{x), . . . converge to p{x) and q{x) then 

liminf j pi{x)log[pi{x)/qi{x)] > J p{x)log[p{x)/q{x)]. 

Switching the limsup and integral in (I78t is justified by Lebesgue's Dominated Convergence Theorem 
[34, p.78] since the integrand is bounded for all values of w. In general, this bound is obtained from 
combining the technical condition requiring h{x\w = w) to be finite with the entropy maximizing 
distribution in d25l) and the expected distortion constraint in i26l to bound h{x + z\q = q). For scaled 
quadratic distortions, h{x + z\q = q) can be bounded above by the entropy of a Gaussian with the 
appropriate variance. To obtain ( l79t we first note that Var [z] implies Var [z|m/ = u;] — > except 
possibly for a set of w having measure zero. This set of measure zero can be ignored because the integrand 
is finite for all w. Finally, for the set of w where Var [z|m/ = u;] ^0, the technical requirement that the 
entropy maximizing distribution in M5\ is continuous shows that the entropy difference ( l79t goes to zero 
in the limit. ■ 

Proof of Theorem When * G {enc, both} in d28b . the encoder can simulate w by generating it 
from (x, q). When * G {dec, none}, the encoder can still simulate w correctly provided that w and q are 
independent. Thus being provided with w provides no advantage given the conditions of the theorem. ■ 

Proof of Theorem ^ We begin by showing 

i?[Q-DEC-W-DEc](D) = i?[Q-NONE-W-DEc](D). (80) 

When side information (q, w) is available only at the decoder, the optimal strategy is Wyner-Ziv encoding 
[4]. Let us compute the optimal reconstruction function v {■,■,■), which maps an auxiliary random variable 
u and the side information q and u/ to a reconstruction of the source: 

v{u, q, w) = argmin£'[(i(x,x; q)\q = q, w = w, u = u] (81) 

= ar:gmmdQ{q)E[di{x, x)\q = q, w = w, u = u] (82) 

X 

= arg min E[di{x, x)\q = q, w = w, u = u] (83) 

X 

= argmin£'[(ii(i;,x)|M/ = tv = n]. (84) 

X 

We obtain iS2l from the assumption that we have a separable distortion measure. To get (l84l l recall that 
by assumption q is statistically independent of x given w and also q is statistically independent of u 
since u is generated at the encoder from x. Thus neither the optimal reconstruction function v{-,-,-) nor 
the auxiliary random variable u depend on q. This establishes dSOl . 
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To show that 

i?[Q-DEC-W-NONE](Z)) = i? [Q-NONE-W-NONe] (D) (85) 

we need w and q to be independent. When this is true, w does not affect anything and the problem is 
equivalent to when w = and is available at the decoder. From iSOl we see that providing w = at the 
decoder does not help and thus we establish dSSb . Note that this argument fails when w and q are not 
independent since in that case Wyner-Ziv based on q could be performed and there would be no w at 
the decoder to enable the argument in (l8n i- (l84l i. 
To show that 

i?[Q-DEC-W-BOTH](L)) = /^[Q-NONE-W-BOTh] (I?) (86) 

we note that in this scenario the encoder and decoder can design a different source coding system for 
each value of w. The subsystem for a fixed value w* corresponds to source coding with distortion side 
information at the decoder. Specifically, the source will have distribution Px\wix\w*), and the distortion 
side information will have distribution Pq\w{Q\w*)- Thus the performance of each subsystem is given by 
i?[Q-DEC-w-NONE](D), which wc already showed is the same as i?[Q-NONE-w-NONE](L'). This estabUshes 



Finally, to show that 

i?[Q-DEC-W-ENc](L') = [Q-NONE-W-ENc] (Z?) (87) 

we require the assumption that q and w are independent. This assumption implies 

R [q-dec-w-enc] (D) = R [q-dec-w-none] (D) (88) 

since an encoder without w could always generated a simulated w with the correct distribution relative 
to the other variables. The same argument implies 

i?[Q-NONE-W-ENc](D) = i?[Q-NONE-W-NONE] (D) . (89) 

Combining and ill yields ■ 
Proof of Theorem^ First we establish the four rate-distortion function equalities implied by (I30at . 
Using Theorem |3] we have 

lim i?[Q-ENC-W-DEcl(D) — i2[Q-BOTH-W-DEcl(Z)) < (90) 

^lim i?[Q-ENC-W-DEc](£') — i?[Q-BOTH-W-BOTH](L>) (91) 

= 0. (92) 
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Similarly, 



To show that 



^lim i?[Q-ENC-W-BOTH](D) — i?[Q-BOTH-W-BOTH](L)) < (93) 
lim i?[Q-ENC-W-DEcl(L)) — ii[Q-BOTH-W-BOTHl(Z)) (94) 

= 0. (95) 



lim i?[Q-ENC-W-N0NEl(i5) — i?[Q-BOTH-W-NONEl(L)) = (96) 



we need q and w to be independent. When this is true, w does not affect anything and the problem 
is equivalent to when w = and is available at the decoder and (l90b-(l92t establishes ( l96t . Without 
independence this argument fails because we can no longer invoke Theorem |3l since there will be no w 
to make x and q conditionally independent in ( I7U . 

To finish establishing (I30at we again require q and w to be independent to obtain 

^lim i?[Q-ENC-W-ENc](D) — i?[Q-BOTH-W-ENc](D) < (97) 
^lim i?[Q-ENC-W-NONE](D) — i?[Q-BOTH-W-ENc](L)) = (98) 
^lim i?[Q-ENC-W-NONE](D) — i?[Q-BOTH-W-NONE](Z)) (99) 

= (100) 



where i99l follows since the encoder can always simulate w from (x, q) and (1 1001) follows from ( l96t . 

Next, we establish the four rate-distortion function equalities implied by ( I30bl ). Using Theorem |5] we 
have 

^lim i?[Q-ENC-W-DEc](i^) — i?[Q-ENC-W-BOTH](Z)) < (101) 
^lim i?[Q-ENC-W-DEc](£') — i?[Q-BOTH-W-BOTH](I?) (102) 

= 0. (103) 



Similarly, 



^lim i2[Q-BOTH-W-DEc](Z)) — i2[Q-BOTH-W-BOTH](L') < (104) 
^lim i?[Q-ENC-W-DEc](L') — i?[Q-BOTH-W-BOTH](L') (105) 

= 0. (106) 
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To show that 



lim i?[Q-NONE-W-DEcl(I?) — i^fQ-NONE-W-BOTHKL)) = (107) 



we need q and w to be independent and we need the distortion measure to be of the form d{x, x; q) = 
do{q)-di (x—x). When this is true the two rate-distortion functions in ilOll are equivalent to the Wyner-Ziv 
rate-distortion function and the conditional rate-distortion function for the difference distortion measure 
E[do{q)] ■di{x — x). Thus we can either apply the result from [8] showing these rate-distortion functions 
are equal in the high-resolution limit or simply specialize Theorem |3] to the case where q is a constant. 

To complete the proof, we again require the assumptions that q and w are independent and that the 
distortion measure is of the form d{x, x; q) = dQ{q) ■ do{q) ■ di{x — x). We have 

lim i?[Q-DEC-W-DEcl(D) — i?[Q-DEC-W-BOTHl (i^) < (108) 
^lim i2[Q-NONE-W-DEc](D) — i?[Q-DEC-W-BOTH](L') = (109) 
^lim i2[Q-NONE-W-DEc](I?) — i?[Q-NONE-W-BOTH](Z') (HO) 

= 0. (Ill) 

where dllOt follows from Theorem |5] and dlllt follows from (I107t . ■ 
Proof of Theorem^ We note that according to Theorem |3 and Theorem |6l we can focus solely on 
the case 

R[q-*-w-none]{D) - R[q-*-w-both]{D). (112) 

When * = NONE, the rate difference in (11121) is the difference between the classical rate-distortion 
function and the conditional rate-distortion function in the high-resolution limit. Thus the Shannon Lower 
Bound [23] (and its conditional version) imply that 

lim i2[Q-NONE-W-NONE](L') — i?[Q-NONE-W-BOTH] (D) = h{x) — h{x\w). (113) 

Similarly, when * = DEC an identical argument can be combined with Theorem |5] 
When * = BOTH, the encoder and decoder can design a separate compression sub-system for each 
value of q. The rate-loss for each sub-system is then /(x; w\q = q) according to high-resolution Wyner- 
Ziv theory [8]. Averaging over all values of q yields a total rate-loss of /(x; w\q). 
Next we consider the case when * = enc and the rate-loss penalty is 

^lim i?[Q-ENC-W-NONE](D) — i?[Q-ENC-W-BOTH] (D) 
= ^lim i?[Q-ENC-W-NONE](L)) — ii[Q-BOTH-W-BOTH](Z)) (114) 
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where the equaUty follows from Theorem |6l 

Using arguments similar to [23] and the proof of Theorem |3] we can obtain a Shannon Lower Bound 
for i?[Q-ENC-w-NONE](D), which is of the form 

i?[Q-ENC-W-NONE](L>) > /l(x) - /i(zd) (115) 

where zd is an entropy maximizing random variable subject to the constraint that E[do{q) ■di{z£i)\ < D. 
Again using argument similar to the proof of Theorem |3j we have that 

^lim i?[Q-BOTH-W-BOTH] (Z?) < /i(x|m/) — /l(zi3). (116) 

Combining ( 11151 ) and dl 161 ) shows that the asymptotic difference in jl 141) is at least /(x; w). 
Next, we obtain the Shannon Lower Bound 

i?[Q-BOTH-W-BOTH](Z)) > h{x\w) - h{zD) (117) 

by duplicating the arguments in the proof of Theorem |5] since this lower bound does not require q and 
w to be independent. Finally, we can obtain the upper bound 

^lim i?[Q-ENC-W-NONE] (D) < /i(x) - /i(zd) (118) 

using an additive noise test channel combined with arguments following those in the proof of Theorem |3] 
Combining dl 171) and dl 181) shows that the asymptotic difference in dl 141) is at most I(x; w). 

■ 

Proof of Theorem^ To simplify the exposition, we first prove the theorem for the relatively simple 
case of a one-dimensional source {k = 1) with a quadratic distortion (r = 2). Then at the end of the 
proof, we describe how to extend it to general k and r. 

We begin with the case where * = NONE. Since Theorems |5] and |^ imply 

i?[Q-NONE-W-NONE](D) = i?[Q-DEC-W-NONE] (D) (119a) 

and 

i?[Q-ENC-W-NONE](L') ^ i?[Q-BOTH-W-NONE](D) , (119b) 

we focus on showing 



^ lim i2[Q-BOTH-W-NONE](D) — i?[Q-NONE-W-NONE] (Z?) = 



(120) 



q 

Computing i2[Q-BOTH-w-NONE](L') is equivalent to finding the rate-distortion function for optimally 
encoding independent random variables and yields the familiar "water-pouring" rate and distortion allo- 
cation [32, Section 13.3.3]. For each q, we quantize the corresponding source samples with distortion 
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Dq = E[{x — x)^] (or £'[||x — xH*"] in the more general case) and rate Rq{Dq). The overall rate and 
distortion then become E[Rq{Dq)\ and E[q • Dq\. 

Thus to find the rate and distortion allocation we set up a constrained optimization problem using 
Lagrange multipliers to obtain the functional 

J{D) = E[Rq{Dq)] + \{D - E[q ■ Dq]), (121) 

differentiate with respect to Dq, set equal to zero and solve for each Dq. In the high-resolution limit, 
various researchers have shown 

RqiDq)^h{x)- hog Dq. (122) 

{e.g., see [23] and references therein). Therefore, it is straightforward to show this process yields the 
condition Dq = 1/(2 Ag) with 2A = 1/D implying 

hin i?[Q-BOTH-w-NONE] (D) h{x) - ^ log D + ^£'[log q] . (123) 

To compute i?[Q-NONE-w-NONE] {D), we note that since neither encoder nor decoder knows q the optimal 
strategy is to simply quantize the source according to the distortion d{q, x; x) = E[q\ ■ (x — x)^ to obtain 

Hm^ [q-none-w-none] {D) h{x) - ^ log D + ^ log E[q\ . (124) 

Comparing \\23\ with (11241) establishes (I120I) . 

By applying Theorem |4] we see that the case where * = enc is the same as * = none. 

Next we consider the case where * = BOTH in In this case, the encoder and decoder can design 
a separate compression sub-system for each value of w and the performance for each sub-system is 
obtained from the case with no signal side information. Specifically, the rate-loss for each sub-system is 



2 



E[q\w = w] 
m 



w = w 



(125) 



q 

according to the previously derived results. Averaging ( I125t over w then yields the rate-loss in ( I34t . 

Finally, we consider the case where * = DEC in (l34l i. Since Theorem |5] implies i?[Q-DEC-w-DEc](Z)) = 
i?[Q-NONE-w-DEc](D) and Theorem |3] implies i?[Q-ENC-w-DEc](L)) i?[Q-BOTH-w-BOTH](Z)), it suffices to 
show that 

EH 

q 

We can compute i?[Q-BOTH-w-BOTH](D) by considering a separate coding system for each value of w. 



^lim i?[Q-DEC-W-DEc](L') — i2[Q-BOTH-W-BOTH] (D) = ^E 



log' 



(126) 



Specifically, conditioned on w = w, computing the rate-distortion trade-off is equivalent to finding 
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i?[Q-BOTH-w-NONE](Z)) for a modified source x' with distribution Px'{x') = Px\w{x'\w). Thus we obtain 

^ lim i?[Q-BOTH-w-BOTH](L') h{x\ M/) — ^ log + ^ log -^[^]- (127) 
Applying the standard techniques used throughout the paper, we can compute the Shannon Lower Bound 

ii[Q-DEC-w-DEc](L>) > h{x\w) - ^ log(L> • E[q\) (128) 

and show it is tight in the high-resolution limit. Comparing (I127t and (11281) establishes the desired result. 

This establishes the theorem for A; = 1 and r = 2. For general k and r, the only change is that each 
component rate-distortion function Rq{Dq) (11221) becomes [23, page 2028] 



Rq{Dq) ^ h{x) --\ogDq--+ log 



k/r 



(129) 



kVkT{k/r) \r J 

and a similar change occurs for all the following rate-distortion expressions. Since we are mainly interested 
in the difference of rate-distortion functions, most of these extra terms cancel out and the only change 
is that factors of 1/2 are replaced with factors of k/r. ■ 

D. Proofs for Non-Asymptotic Bounds 

Before proceeding, we require the following lemma to upper and lower bound the entropy of an 
arbitrary random variable plus a Gaussian mixture. 

Lemma 2 Let x be an arbitrary random variable with finite variance a"^ < oo. Let z be a zero-mean, 
unit-variance Gaussian independent of x and let v be a random variable independent of x and z with 

< fmin <V< f max- Then 

1 1 
h{x) + - log(l + fmin) < h{x + Z^) < h{x) + - log(l + i;max " J(x)) (130) 

with equality if and only if v is a constant and x is Gaussian. 
Proof: The concavity of differential entropy yields 

h{x + Z^Vmin) < + Zy/v) < h{x + Z^^max)- (131) 
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For the lower bound we have 

h{x + z./U~) = j —h{x + z^)dT + h{x) (132) 

= / 7;Jix + zV^)dT + /i(x) (133) 

>- I J(Z VcT^ + T)dT + /l(x) (134) 

2 Jo 

= ^ / ^— + Hx) (135) 

= ilog(l + ^)+Mx) (136) 

where (I133t follows from de Bruijn's identity [32, Theorem 16.6.2], [35, Theorem 14], (I134t follows 
from the fact that a Gaussian distribution minimizes Fisher Information subject to a variance constraint, 
and (I135t follows since the Fisher Information for a Gaussian is the reciprocal of its variance. 
Similarly, for the upper bound we have 

h{x + z./U^) = j —h{x + z^)dT + h{x) (137) 

^J{x + z^/^)dT + h{x) (138) 

<'-r4^P^.dr + hix) (139) 
- 2 Jo J(x) + J(zV^) ^ ^ 

-"W^- (140) 



1 

2/0 



2 7o ^^(x) + 1 

= hog{l + v^.,^-J{x)) + hix) (141) 

where (I138t again follows from de Bruijn's identity, ( I139t follows from the convolution inequality for 
Fisher Information [36], [32, p.497], and (I140t follows since the Fisher Information for a Gaussian is 
the reciprocal of its variance. 

Combining these upper and lower bounds yields the desired result. Finally, the inequalities used in 
(I134t and (I139t are both tight if and only if x is Gaussian. ■ 

As an aside we note that Lemma |2l can be used to bound the rate-distortion function of an arbitrary 
unit-variance source x relative to quadratic distortion. Specifically using an additive Gaussian noise test- 
channel X = z + X and combining Lemma |2l to upper bound h{x + z) with the Shannon Lower Bound 
[23] yields 

h{x) - ^log27reL> < R{D) < h{x) - ^ log 27rei:' + ^ log[l + Z)J(x)]. (142) 
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Evidently, the error in the Shannon Lower Bound is at most ^ log[l + DJ{x)\. Thus, since J(x) > 1 
with equality only for a Gaussian, the sub-optimality of an additive Gaussian noise test-channel is at 
least ^log[l + D]. 

Proof of Theorem^ Starting with the bound for the rate gap from dTTT l. we have 



i?[Q-ENC-W-DEc](D) — i?[Q-BOTH-W-BOTH](Z)) < h{x + z\w) — h{x\w) 

= j [h{x + z\w = w) — h{x\w = w)]pw{w)dw 



< 



< 



1 

2 

J(x| w 



^1 + min 


D ' 


1, 







w 



J{x\ w) 



mm 



1, 



mm 



D 



D 



J{x\w = w)j ^ pw{w)dw 
Pw{w)dw 



(143) 

(144) 
(145) 
(146) 
(147) 



To obtain (I145t we note that z is a Gaussian mixture and apply Lemma|2l This follows since, conditioned 
on q = g, z is a Gaussian with variance E[d{x, x^; q)] where was defined in the proof of Theorem |3] 
to be the optimal distribution when both encoder and decoder know the side information. By considering 
the optimal "water-pouring" distortion allocation for the optimal test-channel distribution x^, it can be 
demonstrated that if the distortion is D, then E[d{x,x^; q)] is at most min[l, D/q] for each q. ■ 

To develop a similar bound for other distortion measures essentially all we need is an upper bound 
for the derivative of h{x + \/t-^) with respect to r. Since entropy is concave, if we can compute this 
derivative for r = then it will be an upper bound for the derivative at all r. 

To obtain the desired derivative at r = 0, we can write 

h{x + ^/Tz)=I{x + ^/^Z■,^/^z)-h{x). (148) 

The results of Prelov and van der Meulen [37] imply that under certain regularity conditions 

^ lim /(x + V^z; V^z) = J(x)/2 , (149) 



which provides the desired derivative. Similarly if we rewrite the mutual information in ( 11481) as a relative 
entropy, then a Taylor series expansion of the relative entropy [38, 2.6] can be used to establish (11491) 
provided certain derivatives of the probability distributions exist. 

Next, we move to proving Theorem ^| An essential part of our proof is an alternative version of the 
Shannon Lower Bound, which we develop in the following lemma. 
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Lemma 3 (Alternative Shannon Lower Bound) Consider a scaled quadratic distortion measure of the 
form d{x,x]q) = q ■ {x — x)^ and let x*^^ denote an optimal test-channel distribution when q and w 
are known at both encoder and decoder. If we define z to have the same distribution as x — x* ^ when 
conditioned on q and furthermore require z to satisfy the Markov condition z ^ q ^ w,x, then 

i?[Q-BOTH-W-BOTH](D) > h{x\w) - h{z\q) . (150) 

Proof: 

i2[Q-B0TH-w-B0TH](D) = I{x*^^;x\q, w) (151) 
= h{x\q,w) - h{x\q,w,x* ,^) (152) 
= h{x\q, w) - h{x - x*^Jq, w,x*^^) (153) 
= h{x\q,w) - h{z\q,w,x* ,^) (154) 
> h{x\q^w) — h{z\q,w). (155) 

■ 

The key difference between Lemma|3land the traditional Shannon Lower Bound (SLB) is in the choice 
of the distribution for z. The traditional SLB uses an entropy maximizing distribution for z, which has 
the advantage of being computable without knowing x*^^. The trouble with the entropy maximizing 
distribution is that it can have an unbounded variance for large distortions. As we show in the following 
lemma, however, the alternative SLB keeps the variance of z bounded. 

Lemma 4 There exists a choice for z in Lemma such that for all values of w, 

\ar[z\w = w] < Var[x|i/i/ = w]. (156) 

Proof: Imagine that we choose some optimal test-channel distribution x* such that the resulting z 
does not satisfy (I156t for some value of w. We will show that it is possible to construct an alternative 
optimal test-channel distribution x*'^ where the resulting z' does satisfy ( 11561 for w = w. 
Specifically, if ( I156l l is not satisfied, then it must be that there exists a set A with 

Var[z|q = q, w = w] = Var[x — x*^^\q = q, w = w] > Var[x|i/i/ = it;],V(g, w) G A. (157) 

Define a new random variable x*'^^ such that x*',^ = x*^^ for all {q,w) A, but with x*'^^ = for all 
(g, w) G A. The distortion is lower for x*'^ by construction. Furthermore, the date processing inequality 
implies that 

Ii^q,w'^ x\ w, q) < -^(x* ^; x| w, q) (158) 
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and so the rate is lower too. Thus if we define z' = x*'y^ — x analogously to how we defined z, then 
condition (11561) is satisfied with z replaced by z'. ■ 
Proof of Theorem 17^ Using the alternative SLB from Lemma |5] and the test-channel distribution 
X = X + z with z chosen according to Lemma |3l we obtain 

i?[Q-ENC-w-DEc] (D) — i?[Q-BOTH-w-BOTH] (D) < /(x + z; q,x\w) — [h{x\w , q) — h{z\w, q)] (159) 



= h{x + z\w) — h{x + z|(7,x, w) — h[x\w^ q) + h{z\w, q) 
= h{x + z\w) — /i(z|q,x, w) — h{x\w, q) + h[z\w, q) 
= h{x + z\w) — h{z\ q) — h{x\ w, q) + h{z\ q) 
= h{x + z\w) — h{x\ w) 

= D{p,\^\\N{al^^)) - I)(Px+z|wl|AA(^xk + ^zV)) 
+ KN{al^^ + al^J\w)-h{N{al^J\w) 



+ 



^ log ( 1 + 



n log 1 + 



<D{p,\^\\M{al^^)) + 

<D{p,\w\W{al^J) + 

<D{p,\^\\N{cjl^^)) + 
= D{p,\JW{al^J) +^-\og[l + 



'z\w 



a. 



' I / 



o log 1 + 



Pw{w)dw 

Pw{w)dw 
Pw{w)dw 



(160) 
(161) 
(162) 
(163) 

(164) 
(165) 

(166) 
(167) 

(168) 

(169) 
(170) 



To obtain ( I163t -( ll67t we use the same arguments as in (I73M 79II plus the additional observation that 
relative entropy is positive and can be dropped in obtaining (I165I) . Next, we apply Lemma |4] to keep 
the variance of the test-channel noise to be at most (j"^^.^-^ to get (I168I I. Finally, the assumption that 



(yl... > crL„ yields (EH 



x\w '— " mm 

To develop a similar bound for other distortion measures, we would use an entropy maximizing 
distribution for the appropriate distortion measure in D{px\^\\-) and D{px+z\\-) above. 
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